AITopics | open-ended question

Collaborating Authors

open-ended question

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning

Gong, Haozhen, Ji, Xiaozhong, Liu, Yuansen, Wu, Wenbin, Yan, Xiaoxiao, Liu, Jingjing, Wu, Kai, Pan, Jiazhen, Jian, Bailiang, Zhang, Jiangning, Hu, Xiaobin, Li, Hongwei Bran

arXiv.org Artificial IntelligenceDec-2-2025

MLLMs MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present Med-CMR, a fine-grained Medical Complex Multimodal Reasoning benchmark. Med-CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine-grained visual understanding and multi-step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small-object detection, fine-detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long-tail generalization, multi-source integration); 3) Broad, high-quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two-stage (human expert + model-assisted) review to ensure clinical authenticity. We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model: 57.81 accuracy on multiple-choice questions (MCQs) and a 48.70 open-ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open-ended score) and leading open-source model Qwen3-VL-235B-A22B (49.34 MCQ accuracy, 42.62 open-ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long-tail generalization emerges as the dominant failure mode. Med-CMR thus provides a stress test for visual-reasoning integration and rare-case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2512.00818

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation

Gao, Jing, Luo, Shutiao, Liu, Yumeng, Li, Yuanming, Zeng, Hongji

arXiv.org Artificial IntelligenceNov-6-2025

With the rapid advancement of natural language processing (NLP) technologies, the demand for high-quality Chinese document question-answering datasets is steadily growing. To address this issue, we present the Chinese Multi-Document Question Answering Dataset(ChiMDQA), specifically designed for downstream business scenarios across prevalent domains including academic, education, finance, law, medical treatment, and news. ChiMDQA encompasses long-form documents from six distinct fields, consisting of 6,068 rigorously curated, high-quality question-answer (QA) pairs further classified into ten fine-grained categories. Through meticulous document screening and a systematic question-design methodology, the dataset guarantees both diversity and high quality, rendering it applicable to various NLP tasks such as document comprehension, knowledge extraction, and intelligent QA systems. Additionally, this paper offers a comprehensive overview of the dataset's design objectives, construction methodologies, and fine-grained evaluation system, supplying a substantial foundation for future research and practical applications in Chinese QA. The code and data are available at: https://anonymous.4open.science/r/Foxit-CHiMDQA/.

large language model, machine learning, question answering, (20 more...)

arXiv.org Artificial Intelligence

2511.03656

Country:

North America > United States (0.68)
Asia > China (0.48)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine (0.88)
Education > Curriculum > Subject-Specific Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Jin, Qiangguo, Zheng, Xianyao, Cui, Hui, Sun, Changming, Fang, Yuqi, Cong, Cong, Su, Ran, Wei, Leyi, Xuan, Ping, Wang, Junbo

arXiv.org Artificial IntelligenceNov-4-2025

Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.

machine learning, question answering, visual question, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.2312/pg.20251303

2511.01357

Country: Asia > China (0.69)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.34)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.74)

Add feedback

Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs

Xiao, Yunpeng, Yang, Carl, Mai, Mark, Hu, Xiao, Shu, Kai

arXiv.org Artificial IntelligenceOct-24-2025

Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (Q\A) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies assumptions, standardizes comparisons, and guides the development of clinically meaningful LLMs.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.20001

Genre:

Research Report (1.00)
Overview (0.87)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Jo, Hwiyeol, Lee, Joosung, Lee, Jaehone, Lee, Sang-Woo, Park, Joonsuk, Yoo, Kang Min

arXiv.org Artificial IntelligenceOct-17-2025

Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.14773

Genre: Research Report (1.00)

Industry: Education (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation

Ma, Bolei, Cao, Yong, Sen, Indira, Haensch, Anna-Carolina, Kreuter, Frauke, Plank, Barbara, Hershcovich, Daniel

arXiv.org Artificial IntelligenceOct-17-2025

Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes "in" LLMs, is essential for realistic social simulation. Drawing on decades of survey-methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.13884

Country:

Asia (0.93)
Europe > Germany (0.68)
North America > United States > California (0.67)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (1.00)
Overview (1.00)

Industry:

Health & Medicine (0.88)
Government (0.67)
Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.83)

Add feedback

Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset

Alwajih, Fakhraddin, Magdy, Samar M., Mekki, Abdellah El, Nacar, Omer, Nafea, Youssef, Abdelfadil, Safaa Taher, Yahya, Abdulfattah Mohammed, Luqman, Hamzah, Almarwani, Nada, Aloufi, Samah, Qawasmen, Baraah, Atou, Houdaifa, Sibaee, Serry, Alsayadi, Hamzah A., Al-Dhabyani, Walid, Al-shaibani, Maged S., Aatar, Aya El, Qandos, Nour, Alhamouri, Rahaf, Ahmad, Samar, Al-Ghrawi, Mohammed Anwar, Yacoub, Aminetou, AbuHweidi, Ruwa, Lemin, Vatimetou Mohamed, Abdel-Salam, Reem, Bashiti, Ahlam, Alansari, Aisha, Ashraf, Ahmed, Alturayeif, Nora, Inciarte, Alcides Alcoba, Ammar, Adel, Elmadany, Abdelrahim A., Tourad, Mohamedou Cheikh, Berrada, Ismail, Jarrar, Mustafa, Shehata, Shady, Abdul-Mageed, Muhammad

arXiv.org Artificial IntelligenceSep-30-2025

Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce PEARL, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 37 annotators from across the Arab world, PEARL comprises over 309K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks (PEARL and PEARL-LITE) along with a specialized subset (PEARL-X) explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models' cultural grounding compared to conventional scaling methods. PEARL establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.21979

Country:

Asia > Middle East (1.00)
Africa > Middle East > Egypt (0.28)

Genre:

Overview (0.92)
Research Report > New Finding (0.67)

Industry:

Information Technology (0.67)
Leisure & Entertainment (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models

Hutson, Dylan, Vennemeyer, Daniel, Deshmukh, Aneesh, Zhan, Justin, Jiang, Tianyu

arXiv.org Artificial IntelligenceSep-25-2025

We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43\%. Prompting constraints guided by IG, such as enforcing question diversity, enable weaker models to significantly improve performance. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.19593

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Leisure & Entertainment > Games (0.46)

Add feedback

How Good are Foundation Models in Step-by-Step Embodied Reasoning?

Dissanayake, Dinura, Heakl, Ahmed, Thawakar, Omkar, Ahsan, Noor, Thawkar, Ritesh, More, Ketan, Lahoud, Jean, Anwer, Rao, Cholakkal, Hisham, Laptev, Ivan, Khan, Fahad Shahbaz, Khan, Salman

arXiv.org Artificial IntelligenceSep-23-2025

Embodied agents operating in the physical world must make decisions that are not only effective but also safe, spatially coherent, and grounded in context. While recent advances in large multimodal models (LMMs) have shown promising capabilities in visual understanding and language generation, their ability to perform structured reasoning for real-world embodied tasks remains underexplored. In this work, we aim to understand how well foundation models can perform step-by-step reasoning in embodied environments. To this end, we propose the Foundation Model Embodied Reasoning (FoMER) benchmark, designed to evaluate the reasoning capabilities of LMMs in complex embodied decision-making scenarios. Our benchmark spans a diverse set of tasks that require agents to interpret multimodal observations, reason about physical constraints and safety, and generate valid next actions in natural language. We present (i) a large-scale, curated suite of embodied reasoning tasks, (ii) a novel evaluation framework that disentangles perceptual grounding from action reasoning, and (iii) empirical analysis of several leading LMMs under this setting. Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments, covering three different robot types. Our results highlight both the potential and current limitations of LMMs in embodied reasoning, pointing towards key challenges and opportunities for future research in robot intelligence. Our data and code will be made publicly available.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.15293

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

Sun, Guangzhi, Manakul, Potsawee, Zhan, Xiao, Gales, Mark

arXiv.org Artificial IntelligenceSep-10-2025

Unlearning has emerged as a critical capability for large language models (LLMs) to support data privacy, regulatory compliance, and ethical AI deployment. Recent techniques often rely on obfuscation by injecting incorrect or irrelevant information to suppress knowledge. Such methods effectively constitute knowledge addition rather than true removal, often leaving models vulnerable to probing. In this paper, we formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework to assess whether existing approaches genuinely remove targeted information. Moreover, we propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions using KL-divergence, effectively removing knowledge about target individuals and triggering appropriate refusal behaviour. Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty that is much higher than obfuscation on probing questions.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.02884

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)

Add feedback